Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics

نویسندگان

  • Anne-Laure Boulesteix
  • Silke Janitza
  • Jochen Kruppa
  • Inke R. König
چکیده

The Random Forest (RF) algorithm by Leo Breiman has become a standard data analysis tool in bioinformatics. It has shown excellent performance in settings where the number of variables is much larger than the number of observations, can cope with complex interaction structures as well as highly correlated variables and returns measures of variable importance. This paper synthesizes ten years of RF development with emphasis on applications to bioinformatics and computational biology. Special attention is given to practical aspects such as the selection of parameters, available RF implementations, and important pitfalls and biases of RF and its variable importance measures (VIMs). The paper surveys recent developments of the methodology relevant to bioinformatics as well as some representative examples of RF applications in this context and possible directions for future research. ∗Corresponding author. Email: [email protected].

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Computational prediction of miRNAs in Nipah virus genome reveals possible interaction with human genes involved in encephalitis

Current re-emergence of Nipah virus (NiV) in India caused 11 deaths so far and many patients were kept in quarantine. A thorough study of previous outbreaks occurred in Malaysia, Bangladesh and India represents cases with high rate of fatality due to acute encephalitis. Our work involves genome analysis of NiV for prediction of miRNAs and their targeted genes in human in order to understand enc...

متن کامل

Random Forest for Bioinformatics

Modern biology has experienced an increasing use of machine learning techniques for large scale and complex biological data analysis. In the area of Bioinformatics, the Random Forest (RF) [6] technique, which includes an ensemble of decision trees and incorporates feature selection and interactions naturally in the learning process, is a popular choice. It is nonparametric, interpretable, effic...

متن کامل

Finding Exact and Solo LTR-Retrotransposons in Biological Sequences Using SVM

Finding repetitive subsequences in genome is a challengeable problem in bioinformatics research area. A lot of approaches have been proposed to solve the problem, which could be divided to library base and de novo methods. The library base methods use predetermined repetitive genome’s subsequences, where library-less methods attempt to discover repetitive subsequences by analytical approach...

متن کامل

Classification of Large microarray Datasets Using Fast Random Forest Construction

Random forest is an ensemble classification algorithm. It performs well when most predictive variables are noisy and can be used when the number of variables is much larger than the number of observations. The use of bootstrap samples and restricted subsets of attributes makes it more powerful than simple ensembles of trees. The main advantage of a random forest classifier is its explanatory po...

متن کامل

The Importance of α-CT and Salt bridges in the Formation of Insulin and its Receptor Complex by Computational Simulation

Insulin hormone is an important part of the endocrine system. It contains two polypeptide chains and plays a pivotal role in regulating carbohydrate metabolism. Insulin receptors (IR) located on cell surface interacts with insulin to control the intake of glucose. Although several studies have tried to clarify the interaction between insulin and its receptor, the mechanism of this interaction r...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Wiley Interdisc. Rew.: Data Mining and Knowledge Discovery

دوره 2  شماره 

صفحات  -

تاریخ انتشار 2012